Skip to content

PHPLIB-1323 Implement unlink for GridFS stream wrapper #1206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jan 12, 2024

Conversation

GromNaN
Copy link
Member

@GromNaN GromNaN commented Dec 15, 2023

Fix PHPLIB-1323

This is not optimized as it creates a WritableStream instance that is not really used.
The context resolver mechanism introduces in #1138 doesn't give direct access to the CollectionWrapper which is required for this feature.

@GromNaN GromNaN force-pushed the PHPLIB-1323 branch 2 times, most recently from 0a08f8e to 7cbf64e Compare December 18, 2023 10:03
@GromNaN GromNaN marked this pull request as ready for review December 18, 2023 10:03
@GromNaN GromNaN requested a review from jmikola December 18, 2023 10:03
Comment on lines 193 to 188
foreach ($revisions as $file) {
$this->collectionWrapper->findFileById($file->get('_id'));
$this->collectionWrapper->deleteFileAndChunksById($file->get('_id'));
$count++;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we can improve on this by collecting all IDs, then deleting all files with the given filename, followed by removing all chunks for the IDs we previously found.

While there's always a chance to end up in an undesirable state, e.g. by having older revisions of files remaining, the suggestion above would reduce this by instructing the server to delete all file documents with a given name, ensuring that future requests for such files would not produce any result. Thus, if there was an error while deleting any chunks, we wouldn't end with older revisions left over.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will implement a solution consistent with rename #1207 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would definitely benefit from some bulk strategy to remove the fs.files documents first (likely using $in on multiple _id values), and then cleaning up the orphan chunks. I think it'd also benefit from being addressed in the GridFS spec, since we'd probably want to have a consistent way to address this in any driver that wishes to do so. And codifying that in a spec would ensure we're not overlooking some edge case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented the bulk delete

  1. get all file ids by filename
  2. delete all files by id
  3. delete all chunks by file ids

The limitation is if there is more revisions that supported by a single command body.

@GromNaN GromNaN force-pushed the PHPLIB-1323 branch 2 times, most recently from cae12ee to 03f4ea2 Compare January 9, 2024 10:04
$count = $this->filesCollection->deleteMany(['_id' => ['$in' => $ids]])->getDeletedCount();
$this->chunksCollection->deleteMany(['files_id' => ['$in' => $ids]]);

return $count ?? 0;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is zero in case of WriteConcern=0 or not file to update.

Comment on lines 193 to 188
foreach ($revisions as $file) {
$this->collectionWrapper->findFileById($file->get('_id'));
$this->collectionWrapper->deleteFileAndChunksById($file->get('_id'));
$count++;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented the bulk delete

  1. get all file ids by filename
  2. delete all files by id
  3. delete all chunks by file ids

The limitation is if there is more revisions that supported by a single command body.

public function delete(): int
{
try {
return $this->collectionWrapper->deleteFileAndChunksByFilename($this->file['filename']);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is an internal class, so we're not actually introducing a public API here, but a delete() method that affects multiple revisions seems odd, which is otherwise an abstraction for writing a single revision.

In the PR description, you said:

The context resolver mechanism introduced in #1138 doesn't give direct access to the CollectionWrapper which is required for this feature.

If we look at stream_open, it accesses a CollectionWrapper after resolving the path to an array of context options. Couldn't the same approach be used here to access a CollectionWrapper directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt deep down that using ReadableStream and WritableStream wasn't correct. I refactored by adding a private getContext method to use CollectionWrapper directly for rename and unlink.

@GromNaN GromNaN force-pushed the PHPLIB-1323 branch 2 times, most recently from 2d9c795 to 1d8e3c4 Compare January 9, 2024 16:47
@GromNaN GromNaN requested a review from jmikola January 9, 2024 17:07
@@ -324,6 +295,14 @@ public function stream_write(string $data): int
return $this->stream->writeBytes($data);
}

public function unlink(string $path): bool
{
$context = $this->getContext($path, 'r');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bucket::resolveStreamContext() executes findFileByFilenameAndRevision() for read modes in order to fetch the file document. Would it make sense to use 'w' here and avoid the unnecessary query?

I understand rename() needs to use 'r' since it does read the filename from the file document; however, there may be a refactoring opportunity there if you'd just like to trust the $fromPath string. I suppose it's a matter of intentionally NOP-ing on a FileNotFoundException or executing an update query that may not match anything -- so I'd be find leaving rename() as-is.

Copy link
Member Author

@GromNaN GromNaN Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of using r mode was to check that the file exists before doing operations, to throw a FileNotFoundException. This could be optimized by checking the number of updated documents after the update, but that's not feasible if Write Concern = 0 (edge case).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted to throw a FileNotFoundException in rename and unlink, instead of silently returning false. This behavior is consistent with S3 stream wrapper.

@@ -57,6 +58,9 @@ revision is read (revision ``-1``).
In write mode, the stream context can contain the option ``gridfs['chunkSizeBytes']``.
If omitted, the defaults are inherited from the ``Bucket`` instance option.

The functions `rename` and `unlink` will rename or remove all revisions of a
filename. If the filename does not exist, these functions throw a ``FileNotFoundException``.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmikola removing or renaming a single revision is not supported. Even if the option gridfs[revision] is set.

$context = $this->getContext($fromPath, 'w');

$newFilename = explode('/', $toPath, 4)[3] ?? '';
$count = $context['collectionWrapper']->updateFilenameForFilename($context['filename'], $newFilename);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edge case: if $fromPath and $toPath are identical, the file exists but you'll throw FileNotFoundException. That may warrant reverting to an "r" mode when selecting the context to throw eagerly.

Alternatively, you could have updateFilenameForFilename() return the matched count and assume that the update was successful or a NOP. The only normal scenario where the matched and modified counts wouldn't be equals is if the paths were the same, in which case you could probably NOP earlier and avoid calling updateFilenameForFilename() altogether (FWIW, PHP's rename() returns true in this case, but I dunno if there's some internal NOP behavior).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I updated the code to return the matched count, and added a test for renaming with the same filename (existing and not existing).

@@ -338,6 +330,42 @@ public function url_stat(string $path, int $flags)
return $this->stream_stat();
}

/** @psalm-return ($mode == 'r' ? array{collectionWrapper: CollectionWrapper, file: object} : array{collectionWrapper: CollectionWrapper, filename: string, options: array}) */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is relevant, but the ternary here only checks for "r" and stream_open() may call this with "rb". Can you invoke functions here, or are you restricted on syntax here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logical or works. I also added the @return annotation so PHPStorm understands it.

@GromNaN GromNaN merged commit b700b10 into mongodb:master Jan 12, 2024
@GromNaN GromNaN deleted the PHPLIB-1323 branch January 12, 2024 12:37
alcaeus added a commit to alcaeus/mongo-php-library that referenced this pull request Jan 15, 2024
* master:
  PHPLIB-1323 Implement `unlink` for GridFS stream wrapper (mongodb#1206)
  PHPLIB-1330: Sync tests for failCommand errorLabels reqs (mongodb#1214)
  PHPLIB-1246: Test PHP 8.3 on Evergreen (mongodb#1213)
  PHPLIB-1324 Implement `rename` for GridFS stream wrapper  (mongodb#1207)
  PHPLIB-1248 Add examples on GridFS (mongodb#1196)
  Deprecate setting GridFS disableMD5 to false explicitly (mongodb#1205)
  PHPLIB-1326: Use more permissive top-level runOnRequirements (mongodb#1210)
  PHPLIB-1206 Add bucket alises for context resolver using GridFS StreamWrapper (mongodb#1138)
  Bump actions/upload-artifact from 3 to 4 (mongodb#1208)
  PHPLIB-1275: Replace apiargs usage in docs with extracts (mongodb#1203)
  Fix title formatting in Client::removeSubscriber() docs (mongodb#1204)
  PHPLIB-1304: Pull mongohouse image from ECR repo (mongodb#1202)
  Fix evergreen failures (mongodb#1200)
  Enable workflows to run for GitHub Merge Queue (mongodb#1199)
  PHPLIB-1313 Ensure the GridFS stream is saved when the script ends (mongodb#1197)
  PHPLIB-1309 Add addSubscriber/removeSubscriber to Client class to ease configuration (mongodb#1195)
  Master is now 1.18-dev
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants